This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
The purpose of this project is to do a comprehensive work on a provided wine data set. The work will involve application of correct statistical methods on the data and complete analysis of the data. It would also include correct and adequate interpretation and discussion on data, graphs, tables and results. The following areas were covered during the project work: 1. Importing wine data into R 2. Review of data in the text file 3. Cleaning of the data 4. Exploratory of data through visualization and 5. Drawing insights from the data.
Relevant Information: These data are the results of a chemical
analysis of wines grown in the same region in Italy but derived from
three different cultivars. The analysis determined the quantities of 13
constituents found in each of the three types of wines. The attributes
are 1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash
5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9)
Proanthocyanins 10) Color intensity 11) Hue 12) OD280/OD315 of diluted
wines 13) Proline
Number of Instances of the variable Class. class 1 59 class 2 71 class 3
48 There are 13 predictor variables and 1 target variable. - 18 missing
data. - 1 misleading data and was treated as an outlier.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggrepel)
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(tidyr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(olsrr)
##
## Attaching package: 'olsrr'
##
## The following object is masked from 'package:MASS':
##
## cement
##
## The following object is masked from 'package:datasets':
##
## rivers
library(stats)
library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Importing of wine data in R
# Load data into R
initial_data <- read.table("wine.txt", sep = ",", header = FALSE)
initial_data
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
## 1 1 14.23 1.71 NA 15.6 127 2.80 3.06 0.28 2.29 5.64 1.040 3.92
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.050 3.40
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.030 3.17
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.860 3.45
## 5 1 13.24 2.59 2.87 21 118 2.80 2.69 0.39 1.82 4.32 1.040 2.93
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.050 2.85
## 7 1 14.39 1.87 2.45 14.6 96 2.50 2.52 0.30 1.98 5.25 1.020 3.58
## 8 1 14.06 2.15 2.61 17.6 121 2.60 2.51 0.31 1.25 5.05 1.060 3.58
## 9 1 14.83 1.64 2.17 14 97 2.80 2.98 0.29 1.98 5.20 1.080 2.85
## 10 1 13.86 1.35 2.27 16 98 2.98 3.15 0.22 1.85 7.22 1.010 3.55
## 11 1 14.10 2.16 2.30 18 105 2.95 3.32 0.22 2.38 5.75 1.250 3.17
## 12 1 14.12 1.48 2.32 16.8 95 2.20 2.43 0.26 1.57 5.00 1.170 2.82
## 13 1 13.75 1.73 2.41 16 89 2.60 2.76 0.29 1.81 5.60 1.150 2.90
## 14 1 14.75 1.73 2.39 11.4 91 3.10 3.69 0.43 2.81 5.40 1.250 2.73
## 15 1 14.38 1.87 2.38 12 102 3.30 3.64 0.29 2.96 7.50 1.200 3.00
## 16 1 13.63 1.81 2.70 17.2 112 2.85 2.91 0.30 1.46 7.30 1.280 2.88
## 17 1 14.30 1.92 2.72 20 120 2.80 3.14 0.33 1.97 6.20 1.070 2.65
## 18 1 13.83 1.57 2.62 20 115 2.95 3.40 0.40 1.72 6.60 1.130 2.57
## 19 1 14.19 1.59 2.48 16.5 108 3.30 3.93 0.32 1.86 8.70 1.230 2.82
## 20 1 13.64 3.1 2.56 15.2 116 2.70 3.03 0.17 1.66 5.10 0.960 3.36
## 21 1 14.06 1.63 2.28 16 126 3.00 3.17 0.24 2.1 5.65 1.090 3.71
## 22 1 12.93 3.8 2.65 18.6 . 2.41 2.41 0.25 1.98 4.50 1.030 3.52
## 23 1 13.71 1.86 2.36 16.6 101 2.61 2.88 0.27 1.69 3.80 1.110 4.00
## 24 1 12.85 1.6 2.52 17.8 95 NA 2.37 0.26 1.46 3.93 1.090 3.63
## 25 1 13.50 1.81 2.61 20 96 2.53 2.61 0.28 1.66 3.52 1.120 3.82
## 26 1 13.05 2.05 3.22 25 124 2.63 2.68 0.47 1.92 3.58 1.130 3.20
## 27 1 13.39 1.77 2.62 16.1 93 2.85 2.94 0.34 1.45 4.80 0.920 NA
## 28 1 13.30 1.72 2.14 17 94 2.40 2.19 0.27 1.35 3.95 1.020 2.77
## 29 1 13.87 1.9 2.80 19.4 107 2.95 2.97 0.37 1.76 4.50 1.250 3.40
## 30 1 14.02 1.68 2.21 16 96 2.65 2.33 0.26 1.98 4.70 1.040 3.59
## 31 1 13.73 1.5 2.70 22.5 101 3.00 3.25 0.29 2.38 5.70 1.190 2.71
## 32 1 13.58 1.66 2.36 19.1 106 2.86 3.19 0.22 1.95 6.90 1.090 2.88
## 33 1 13.68 1.83 2.36 17.2 104 2.42 2.69 0.42 1.97 3.84 1.230 2.87
## 34 1 13.76 1.53 2.70 19.5 132 2.95 2.74 0.50 1.35 5.40 1.250 3.00
## 35 1 13.51 1.8 2.65 19 110 2.35 2.53 0.29 1.54 4.20 1.100 2.87
## 36 1 13.48 2.41 NULL 100 2.70 2.98 0.26 1.86 5.10 1.040 3.47
## 37 1 13.28 1.64 2.84 15.5 110 2.60 2.68 0.34 1.36 4.60 1.090 2.78
## 38 1 13.05 1.65 2.55 18 98 2.45 2.43 0.29 1.44 4.25 1.120 2.51
## 39 1 13.07 1.5 2.10 15.5 98 2.40 2.64 0.28 1.37 3.70 1.180 2.69
## 40 1 14.22 3.99 2.51 13.2 128 3.00 3.04 0.20 2.08 5.10 0.890 3.53
## 41 1 13.56 1.71 NA 16.2 117 3.15 NA 0.34 2.34 6.13 0.950 3.38
## 42 1 13.41 3.84 2.12 18.8 90 2.45 2.68 0.27 1.48 4.28 0.910 3.00
## 43 1 13.88 1.89 2.59 15 101 3.25 3.56 0.17 1.7 5.43 0.880 3.56
## 44 1 13.24 3.98 2.29 17.5 103 2.64 2.63 0.32 1.66 4.36 0.820 3.00
## 45 1 13.05 1.77 2.10 17 107 3.00 3.00 0.28 2.03 5.04 0.880 3.35
## 46 1 14.21 4.04 2.44 18.9 111 2.85 2.65 0.30 1.25 5.24 0.870 3.33
## 47 1 14.38 3.59 2.28 16 102 3.25 3.17 0.27 2.19 4.90 1.040 3.44
## 48 1 13.90 1.68 2.12 101 3.10 3.39 0.21 2.14 6.10 0.910 3.33
## 49 1 14.10 2.02 2.40 18.8 . 2.75 2.92 0.32 na 6.20 1.070 2.75
## 50 1 13.94 1.73 2.27 17.4 108 2.88 3.54 0.32 2.08 8.90 1.120 3.10
## 51 1 13.05 1.73 2.04 12.4 92 2.72 3.27 0.17 2.91 7.20 1.120 2.91
## 52 1 13.83 1.65 2.60 17.2 94 2.45 2.99 NA 2.29 5.60 1.240 3.37
## 53 1 13.82 . 2.42 14 111 3.88 3.74 0.32 1.87 7.05 1.010 3.26
## 54 1 13.77 1.9 NA 17.1 115 3.00 2.79 0.39 1.68 6.30 1.130 2.93
## 55 1 13.74 1.67 2.25 16.4 118 2.60 2.90 0.21 1.62 5.85 0.920 3.20
## 56 1 13.56 1.73 2.46 20.5 116 2.96 2.78 0.20 2.45 6.25 0.980 3.03
## 57 1 14.22 1.7 2.30 16.3 . 3.20 3.00 0.26 2.03 6.38 0.940 3.31
## 58 1 13.29 1.97 2.68 . 102 3.00 3.23 0.31 1.66 6.00 1.070 2.84
## 59 1 13.72 1.43 2.50 na 108 3.40 3.67 0.19 2.04 6.80 0.890 2.87
## 60 2 12.37 .94 1.36 na 88 1.98 0.57 0.28 .42 1.95 1.050 1.82
## 61 2 12.33 1.1 2.28 16 101 2.05 1.09 0.63 .41 3.27 1.250 1.67
## 62 2 12.64 1.36 2.02 16.8 100 2.02 1.41 0.53 .62 5.75 0.980 1.59
## 63 2 13.67 1.25 1.92 18 94 2.10 1.79 0.32 .73 3.80 1.230 2.46
## 64 2 12.37 1.13 2.16 19 87 3.50 3.10 0.19 1.87 4.45 1.220 2.87
## 65 2 12.17 1.45 2.53 19 104 1.89 1.75 0.45 1.03 2.95 1.450 2.23
## 66 2 12.37 1.21 2.56 18.1 98 2.42 2.65 0.37 2.08 4.60 1.190 2.30
## 67 2 13.11 1.01 1.70 15 78 2.98 3.18 0.26 2.28 5.30 1.120 3.18
## 68 2 12.37 1.17 1.92 19.6 78 2.11 2.00 0.27 1.04 4.68 1.120 3.48
## 69 2 13.34 .94 2.36 17 110 2.53 1.30 0.55 .42 3.17 1.020 1.93
## 70 2 12.21 1.19 1.75 16.8 151 1.85 1.28 0.14 2.5 2.85 1.280 3.07
## 71 2 12.29 1.61 2.21 20.4 103 1.10 1.02 0.37 1.46 3.05 0.906 1.82
## 72 2 13.86 1.51 2.67 25 86 2.95 2.86 0.21 1.87 3.38 1.360 3.16
## 73 2 13.49 1.66 2.24 24 87 1.88 1.84 0.27 1.03 3.74 0.980 2.78
## 74 2 12.99 1.67 2.60 30 139 3.30 2.89 0.21 1.96 3.35 1.310 3.50
## 75 2 11.96 1.09 2.30 21 101 3.38 2.14 0.13 1.65 3.21 0.990 3.13
## 76 2 11.66 1.88 1.92 16 97 1.61 1.57 0.34 1.15 3.80 1.230 2.14
## 77 2 13.03 .9 1.71 16 86 1.95 2.03 0.24 1.46 4.60 1.190 2.48
## 78 2 11.84 2.89 2.23 18 112 1.72 1.32 0.43 .95 2.65 0.960 2.52
## 79 2 12.33 .99 1.95 14.8 136 1.90 1.85 0.35 2.76 3.40 1.060 2.31
## 80 2 12.70 3.87 2.40 23 101 2.83 2.55 0.43 1.95 2.57 1.190 3.13
## 81 2 12.00 .92 2.00 19 86 2.42 2.26 0.30 1.43 2.50 1.380 3.12
## 82 2 12.72 1.81 2.20 18.8 86 2.20 2.53 0.26 1.77 3.90 1.160 3.14
## 83 2 12.08 1.13 2.51 24 78 2.00 1.58 0.40 1.4 2.20 1.310 2.72
## 84 2 13.05 3.86 2.32 22.5 85 1.65 1.59 0.61 1.62 4.80 0.840 2.01
## 85 2 11.84 .89 2.58 18 94 2.20 2.21 0.22 2.35 3.05 0.790 3.08
## 86 2 12.67 .98 2.24 18 99 2.20 1.94 0.30 1.46 2.62 1.230 3.16
## 87 2 12.16 1.61 2.31 22.8 90 1.78 1.69 0.43 1.56 2.45 1.330 2.26
## 88 2 11.65 1.67 2.62 26 88 1.92 1.61 0.40 1.34 2.60 1.360 3.21
## 89 2 11.64 2.06 2.46 21.6 84 1.95 1.69 0.48 1.35 2.80 1.000 2.75
## 90 2 12.08 1.33 2.30 23.6 70 2.20 1.59 0.42 1.38 1.74 1.070 3.21
## 91 2 12.08 1.83 2.32 18.5 81 1.60 1.50 0.52 1.64 2.40 1.080 2.27
## 92 2 12.00 1.51 2.42 22 86 1.45 1.25 0.50 1.63 3.60 1.050 2.65
## 93 2 12.69 1.53 2.26 20.7 80 1.38 1.46 0.58 1.62 3.05 0.960 2.06
## 94 2 12.29 2.83 2.22 18 88 2.45 2.25 0.25 1.99 2.15 1.150 3.30
## 95 2 11.62 1.99 2.28 18 98 3.02 2.26 0.17 1.35 3.25 1.160 2.96
## 96 2 12.47 1.52 2.20 19 162 2.50 2.27 0.32 3.28 2.60 1.160 2.63
## 97 2 11.81 2.12 2.74 21.5 134 1.60 0.99 0.14 1.56 2.50 0.950 2.26
## 98 2 12.29 1.41 1.98 16 85 2.55 2.50 0.29 1.77 2.90 1.230 2.74
## 99 2 12.37 1.07 2.10 18.5 88 3.52 3.75 0.24 1.95 4.50 1.040 2.77
## 100 2 12.29 3.17 2.21 18 88 2.85 2.99 0.45 2.81 2.30 1.420 2.83
## 101 2 12.08 2.08 1.70 17.5 97 2.23 2.17 0.26 1.4 3.30 1.270 2.96
## 102 2 12.60 1.34 1.90 18.5 88 1.45 1.36 0.29 1.35 2.45 1.040 2.77
## 103 2 12.34 2.45 2.46 21 98 2.56 2.11 0.34 1.31 2.80 0.800 3.38
## 104 2 11.82 1.72 1.88 19.5 86 2.50 1.64 0.37 1.42 2.06 0.940 2.44
## 105 2 12.51 1.73 1.98 20.5 85 2.20 1.92 0.32 1.48 2.94 1.040 3.57
## 106 2 12.42 2.55 2.27 22 90 1.68 1.84 0.66 1.42 2.70 0.860 3.30
## 107 2 12.25 1.73 2.12 19 80 1.65 2.03 0.37 1.63 3.40 1.000 3.17
## 108 2 12.72 1.75 2.28 22.5 84 1.38 1.76 0.48 1.63 3.30 0.880 2.42
## 109 2 12.22 1.29 1.94 19 92 2.36 2.04 0.39 2.08 2.70 0.860 3.02
## 110 2 11.61 1.35 2.70 20 94 2.74 2.92 0.29 2.49 2.65 0.960 3.26
## 111 2 11.46 3.74 1.82 19.5 107 3.18 2.58 0.24 3.58 2.90 0.750 2.81
## 112 2 12.52 2.43 2.17 21 88 2.55 2.27 0.26 1.22 2.00 0.900 2.78
## 113 2 11.76 2.68 2.92 20 103 1.75 2.03 0.60 1.05 3.80 1.230 2.50
## 114 2 11.41 .74 2.50 21 88 2.48 2.01 0.42 1.44 3.08 1.100 2.31
## 115 2 12.08 1.39 2.50 22.5 84 2.56 2.29 0.43 1.04 2.90 0.930 3.19
## 116 2 11.03 1.51 2.20 21.5 85 2.46 2.17 0.52 2.01 1.90 1.710 2.87
## 117 2 11.82 1.47 1.99 20.8 86 1.98 1.60 0.30 1.53 1.95 0.950 3.33
## 118 2 12.42 1.61 2.19 22.5 108 2.00 2.09 0.34 1.61 2.06 1.060 2.96
## 119 2 12.77 3.43 1.98 16 80 1.63 1.25 0.43 .83 3.40 0.700 2.12
## 120 2 12.00 3.43 2.00 19 87 2.00 1.64 0.37 1.87 1.28 0.930 3.05
## 121 2 11.45 2.4 2.42 20 96 2.90 2.79 0.32 1.83 3.25 0.800 3.39
## 122 2 11.56 2.05 3.23 28.5 119 3.18 5.08 0.47 1.87 6.00 0.930 3.69
## 123 2 12.42 4.43 2.73 26.5 102 2.20 2.13 0.43 1.71 2.08 0.920 3.12
## 124 2 13.05 5.8 2.13 21.5 86 2.62 2.65 0.30 2.01 2.60 0.730 3.10
## 125 2 11.87 4.31 2.39 21 82 2.86 3.03 0.21 2.91 2.80 0.750 3.64
## 126 2 12.07 2.16 2.17 21 85 2.60 2.65 0.37 1.35 2.76 0.860 3.28
## 127 2 12.43 1.53 2.29 21.5 86 2.74 3.15 0.39 1.77 3.94 0.690 2.84
## 128 2 11.79 2.13 2.78 28.5 92 2.13 2.24 0.58 1.76 3.00 0.970 2.44
## 129 2 12.37 1.63 2.30 24.5 88 2.22 2.45 0.40 1.9 2.12 0.890 2.78
## 130 2 12.04 4.3 2.38 22 80 2.10 1.75 0.42 1.35 2.60 0.790 2.57
## 131 3 12.86 1.35 2.32 18 122 1.51 1.25 0.21 .94 4.10 0.760 1.29
## 132 3 12.88 2.99 2.40 20 104 1.30 1.22 0.24 .83 5.40 0.740 1.42
## 133 3 12.81 2.31 2.40 24 98 1.15 1.09 0.27 .83 5.70 0.660 1.36
## 134 3 12.70 3.55 2.36 21.5 106 1.70 1.20 0.17 .84 5.00 0.780 1.29
## 135 3 12.51 1.24 2.25 17.5 85 2.00 0.58 0.60 1.25 5.45 0.750 1.51
## 136 3 12.60 2.46 2.20 18.5 94 1.62 0.66 0.63 .94 7.10 0.730 1.58
## 137 3 12.25 4.72 2.54 21 89 1.38 0.47 0.53 .8 3.85 0.750 1.27
## 138 3 12.53 5.51 2.64 25 96 1.79 0.60 0.63 1.1 5.00 0.820 1.69
## 139 3 13.49 3.59 2.19 19.5 88 1.62 0.48 0.58 .88 5.70 0.810 1.82
## 140 3 12.84 2.96 2.61 24 101 2.32 0.60 0.53 .81 4.92 0.890 2.15
## 141 3 12.93 2.81 2.70 21 96 1.54 0.50 0.53 .75 4.60 0.770 2.31
## 142 3 13.36 2.56 2.35 20 89 1.40 0.50 0.37 .64 5.60 0.700 2.47
## 143 3 13.52 3.17 2.72 23.5 97 1.55 0.52 0.50 .55 4.35 0.890 2.06
## 144 3 13.62 4.95 2.35 20 92 2.00 0.80 0.47 1.02 4.40 0.910 2.05
## 145 3 12.25 3.88 2.20 18.5 112 1.38 0.78 0.29 1.14 8.21 0.650 2.00
## 146 3 13.16 3.57 2.15 21 102 1.50 0.55 0.43 1.3 4.00 0.600 1.68
## 147 3 13.88 5.04 2.23 20 80 0.98 0.34 0.40 .68 4.90 0.580 1.33
## 148 3 12.87 4.61 2.48 21.5 86 1.70 0.65 0.47 .86 7.65 0.540 1.86
## 149 3 13.32 3.24 2.38 21.5 92 1.93 0.76 0.45 1.25 8.42 0.550 1.62
## 150 3 13.08 3.9 2.36 21.5 113 1.41 1.39 0.34 1.14 9.40 0.570 1.33
## 151 3 13.50 3.12 2.62 24 123 1.40 1.57 0.22 1.25 8.60 0.590 1.30
## 152 3 12.79 2.67 2.48 22 112 1.48 1.36 0.24 1.26 10.80 0.480 1.47
## 153 3 13.11 1.9 2.75 25.5 116 2.20 1.28 0.26 1.56 7.10 0.610 1.33
## 154 3 13.23 3.3 2.28 18.5 98 1.80 0.83 0.61 1.87 10.52 0.560 1.51
## 155 3 12.58 1.29 2.10 20 103 1.48 0.58 0.53 1.4 7.60 0.580 1.55
## 156 3 13.17 5.19 2.32 22 93 1.74 0.63 0.61 1.55 7.90 0.600 1.48
## 157 3 13.84 4.12 2.38 19.5 89 1.80 0.83 0.48 1.56 9.01 0.570 1.64
## 158 3 12.45 3.03 2.64 27 97 1.90 0.58 0.63 1.14 7.50 0.670 1.73
## 159 3 14.34 1.68 2.70 25 98 2.80 1.31 0.53 2.7 13.00 0.570 1.96
## 160 3 13.48 1.67 2.64 22.5 89 2.60 1.10 0.52 2.29 11.75 0.570 1.78
## 161 3 12.36 3.83 2.38 21 88 2.30 0.92 0.50 1.04 7.65 0.560 1.58
## 162 3 13.69 3.26 2.54 20 107 1.83 0.56 0.50 .8 5.88 0.960 1.82
## 163 3 12.85 3.27 2.58 22 106 1.65 0.60 0.60 .96 5.58 0.870 2.11
## 164 3 12.96 3.45 2.35 18.5 106 1.39 0.70 0.40 .94 5.28 0.680 1.75
## 165 3 13.78 2.76 2.30 22 90 1.35 0.68 0.41 1.03 9.58 0.700 1.68
## 166 3 13.73 4.36 2.26 22.5 88 1.28 0.47 0.52 1.15 6.62 0.780 1.75
## 167 3 13.45 3.7 2.60 23 111 1.70 0.92 0.43 1.46 10.68 0.850 1.56
## 168 3 12.82 3.37 2.30 19.5 88 1.48 0.66 0.40 .97 10.26 0.720 1.75
## 169 3 13.58 2.58 2.69 24.5 105 1.55 0.84 0.39 1.54 8.66 0.740 1.80
## 170 3 13.40 4.6 2.86 25 112 1.98 0.96 0.27 1.11 8.50 0.670 1.92
## 171 3 12.20 3.03 2.32 19 96 1.25 0.49 0.40 .73 5.50 0.660 1.83
## 172 3 12.77 2.39 2.28 19.5 86 1.39 0.51 0.48 .64 9899999.00 0.570 1.63
## 173 3 14.16 2.51 2.48 20 91 1.68 0.70 0.44 1.24 9.70 0.620 1.71
## 174 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.70 0.640 1.74
## 175 3 13.40 3.91 2.48 23 102 1.80 0.75 0.43 1.41 7.30 0.700 1.56
## 176 3 13.27 4.28 2.26 20 120 1.59 0.69 0.43 1.35 10.20 0.590 1.56
## 177 3 13.17 2.59 2.37 20 99999 1.65 0.68 0.53 1.46 9.30 0.600 1.62
## 178 3 14.13 4.1 2.74 24.5 96 2.05 0.76 0.56 1.35 9.20 0.610 1.60
## V14
## 1 1065
## 2 1050
## 3 1185
## 4 1480
## 5 735
## 6 1450
## 7 1290
## 8 1295
## 9 1045
## 10 1045
## 11 1510
## 12 1280
## 13 1320
## 14 1150
## 15 1547
## 16 1310
## 17 1280
## 18 1130
## 19 1680
## 20 845
## 21 780
## 22 770
## 23 1035
## 24 1015
## 25 845
## 26 830
## 27 1195
## 28 1285
## 29 915
## 30 1035
## 31 1285
## 32 1515
## 33 990
## 34 1235
## 35 1095
## 36 920
## 37 880
## 38 1105
## 39 1020
## 40 760
## 41 795
## 42 1035
## 43 1095
## 44 680
## 45 885
## 46 1080
## 47 1065
## 48 985
## 49 1060
## 50 1260
## 51 1150
## 52 1265
## 53 1190
## 54 1375
## 55 1060
## 56 1120
## 57 970
## 58 1270
## 59 1285
## 60 520
## 61 680
## 62 450
## 63 630
## 64 420
## 65 355
## 66 678
## 67 502
## 68 510
## 69 750
## 70 718
## 71 870
## 72 410
## 73 472
## 74 985
## 75 886
## 76 428
## 77 392
## 78 500
## 79 750
## 80 463
## 81 278
## 82 714
## 83 630
## 84 515
## 85 520
## 86 450
## 87 495
## 88 562
## 89 680
## 90 625
## 91 480
## 92 450
## 93 495
## 94 290
## 95 345
## 96 937
## 97 625
## 98 428
## 99 660
## 100 406
## 101 710
## 102 562
## 103 438
## 104 415
## 105 672
## 106 315
## 107 510
## 108 488
## 109 312
## 110 680
## 111 562
## 112 325
## 113 607
## 114 434
## 115 385
## 116 407
## 117 495
## 118 345
## 119 372
## 120 564
## 121 625
## 122 465
## 123 365
## 124 380
## 125 380
## 126 378
## 127 352
## 128 466
## 129 342
## 130 580
## 131 630
## 132 530
## 133 560
## 134 600
## 135 650
## 136 695
## 137 720
## 138 515
## 139 580
## 140 590
## 141 600
## 142 780
## 143 520
## 144 550
## 145 855
## 146 830
## 147 415
## 148 625
## 149 650
## 150 550
## 151 500
## 152 480
## 153 425
## 154 675
## 155 640
## 156 725
## 157 480
## 158 880
## 159 660
## 160 620
## 161 520
## 162 680
## 163 570
## 164 675
## 165 615
## 166 520
## 167 695
## 168 685
## 169 750
## 170 630
## 171 510
## 172 470
## 173 660
## 174 740
## 175 750
## 176 835
## 177 840
## 178 560
View data
# View the head of data
head(initial_data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 1 14.23 1.71 NA 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5 1 13.24 2.59 2.87 21 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
# View the tail of data
tail(initial_data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 173 3 14.16 2.51 2.48 20 91 1.68 0.70 0.44 1.24 9.7 0.62 1.71 660
## 174 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.7 0.64 1.74 740
## 175 3 13.40 3.91 2.48 23 102 1.80 0.75 0.43 1.41 7.3 0.70 1.56 750
## 176 3 13.27 4.28 2.26 20 120 1.59 0.69 0.43 1.35 10.2 0.59 1.56 835
## 177 3 13.17 2.59 2.37 20 99999 1.65 0.68 0.53 1.46 9.3 0.60 1.62 840
## 178 3 14.13 4.1 2.74 24.5 96 2.05 0.76 0.56 1.35 9.2 0.61 1.60 560
Shape of the data
dim(initial_data)
## [1] 178 14
There are 178 rows and columns in the data.
Data pre-processing involves the following: 1. Data cleaning 2. Data integration 3. Data reduction 4. Data transformation For this this project, the focus will be mainly be on data cleaning. The framework for data cleaning are: 1. Understand the data structure. 2. Validate the fields and values. 3. Interpret statistics. 4. Visualize the data. These tasks are really important before the data is used for the model building and other requirements of the business or institution. This section will be treating missing data, outliers treatments, and plotting of graphs with statistical analysis.
str(initial_data)
## 'data.frame': 178 obs. of 14 variables:
## $ V1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2 : num 14.2 13.2 13.2 14.4 13.2 ...
## $ V3 : chr "1.71" "1.78" "2.36" "1.95" ...
## $ V4 : num NA 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ V5 : chr "15.6" "11.2" "18.6" "16.8" ...
## $ V6 : chr "127" "100" "101" "113" ...
## $ V7 : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ V8 : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ V9 : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ V10: chr "2.29" "1.28" "2.81" "2.18" ...
## $ V11: num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ V12: num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ V13: num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ V14: int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
This section gives us a view of the data types in the data set. A summary of the data type is below: i. Integer (int) - 2 ii. Character - 4 iii. Numeric - 8 Total - 14. Also the variables of the data came with default names which were changed to the actually names provided in the additional text file.
Statistical summary of data
summary(initial_data)
## V1 V2 V3 V4
## Min. :1.000 Min. :11.03 Length:178 Min. :1.360
## 1st Qu.:1.000 1st Qu.:12.36 Class :character 1st Qu.:2.210
## Median :2.000 Median :13.05 Mode :character Median :2.360
## Mean :1.938 Mean :13.00 Mean :2.365
## 3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:2.555
## Max. :3.000 Max. :14.83 Max. :3.230
## NA's :3
## V5 V6 V7 V8
## Length:178 Length:178 Min. :0.980 Min. :0.340
## Class :character Class :character 1st Qu.:1.740 1st Qu.:1.200
## Mode :character Mode :character Median :2.350 Median :2.130
## Mean :2.294 Mean :2.022
## 3rd Qu.:2.800 3rd Qu.:2.860
## Max. :3.880 Max. :5.080
## NA's :1 NA's :1
## V9 V10 V11 V12
## Min. :0.1300 Length:178 Min. : 1 Min. :0.4800
## 1st Qu.:0.2700 Class :character 1st Qu.: 3 1st Qu.:0.7825
## Median :0.3400 Mode :character Median : 5 Median :0.9650
## Mean :0.3627 Mean : 55623 Mean :0.9574
## 3rd Qu.:0.4400 3rd Qu.: 6 3rd Qu.:1.1200
## Max. :0.6600 Max. :9899999 Max. :1.7100
## NA's :1
## V13 V14
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.930 1st Qu.: 500.5
## Median :2.780 Median : 673.5
## Mean :2.608 Mean : 746.9
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
## NA's :1
Above is the statistical summary of variables in the data set. It gives the mean, median, maximum, minimum, 1st quartile, and 3rd quartile information for all variables in the data set. A view of this data shows that there could be possible outliers. Also the whole data set is numeric so would be changing the character data type to numeric data. The variable V11 shows misleading value of 9899999.
Assign variable name to all the variables.
# Rename column names
colnames(initial_data) <- c("Alcohol", "Malic_acid", "Ash", "Alcalinity_of_ash", "Magnesium", "Total_phenols", "Flavanoids", "Nonflavanoid_phenols", "Proanthocyanins", "Color_intensity", "Hue", "12", "OD280_OD315_of_diluted_wines", "Proline")
# Display the column names.
colnames(initial_data)
## [1] "Alcohol" "Malic_acid"
## [3] "Ash" "Alcalinity_of_ash"
## [5] "Magnesium" "Total_phenols"
## [7] "Flavanoids" "Nonflavanoid_phenols"
## [9] "Proanthocyanins" "Color_intensity"
## [11] "Hue" "12"
## [13] "OD280_OD315_of_diluted_wines" "Proline"
summary(initial_data)
## Alcohol Malic_acid Ash Alcalinity_of_ash
## Min. :1.000 Min. :11.03 Length:178 Min. :1.360
## 1st Qu.:1.000 1st Qu.:12.36 Class :character 1st Qu.:2.210
## Median :2.000 Median :13.05 Mode :character Median :2.360
## Mean :1.938 Mean :13.00 Mean :2.365
## 3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:2.555
## Max. :3.000 Max. :14.83 Max. :3.230
## NA's :3
## Magnesium Total_phenols Flavanoids Nonflavanoid_phenols
## Length:178 Length:178 Min. :0.980 Min. :0.340
## Class :character Class :character 1st Qu.:1.740 1st Qu.:1.200
## Mode :character Mode :character Median :2.350 Median :2.130
## Mean :2.294 Mean :2.022
## 3rd Qu.:2.800 3rd Qu.:2.860
## Max. :3.880 Max. :5.080
## NA's :1 NA's :1
## Proanthocyanins Color_intensity Hue 12
## Min. :0.1300 Length:178 Min. : 1 Min. :0.4800
## 1st Qu.:0.2700 Class :character 1st Qu.: 3 1st Qu.:0.7825
## Median :0.3400 Mode :character Median : 5 Median :0.9650
## Mean :0.3627 Mean : 55623 Mean :0.9574
## 3rd Qu.:0.4400 3rd Qu.: 6 3rd Qu.:1.1200
## Max. :0.6600 Max. :9899999 Max. :1.7100
## NA's :1
## OD280_OD315_of_diluted_wines Proline
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.930 1st Qu.: 500.5
## Median :2.780 Median : 673.5
## Mean :2.608 Mean : 746.9
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
## NA's :1
The summary above display the change in variable name which was successful.
Change data type from character to numeric so that
initial_data$Ash <- as.numeric(initial_data$Ash)
## Warning: NAs introduced by coercion
initial_data$Magnesium <- as.numeric(initial_data$Magnesium)
## Warning: NAs introduced by coercion
initial_data$Total_phenols <- as.integer(initial_data$Total_phenols)
## Warning: NAs introduced by coercion
initial_data$Color_intensity <- as.numeric(initial_data$Color_intensity)
## Warning: NAs introduced by coercion
str(initial_data)
## 'data.frame': 178 obs. of 14 variables:
## $ Alcohol : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Malic_acid : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Ash : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Alcalinity_of_ash : num NA 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Magnesium : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Total_phenols : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Flavanoids : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Nonflavanoid_phenols : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Proanthocyanins : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Color_intensity : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Hue : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ 12 : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ OD280_OD315_of_diluted_wines: num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
The above confirm the changing of the character data type to numeric data type.
Working on missing data.
# Checking for missing data
initial_data_miss <- sum(is.na(initial_data))
cat (".", "\n")
## .
cat ("Missing data in this data set is : ", initial_data_miss)
## Missing data in this data set is : 18
# Identify columns with missing data
missing_cols <- colnames(initial_data)[apply(is.na(initial_data), 2, any)]
missing_cols
## [1] "Ash" "Alcalinity_of_ash"
## [3] "Magnesium" "Total_phenols"
## [5] "Flavanoids" "Nonflavanoid_phenols"
## [7] "Proanthocyanins" "Color_intensity"
## [9] "OD280_OD315_of_diluted_wines"
Above is the columns that have missing data.
Removing of missing identified missing data in the data set.
# Replace the missing data in the integer variable 'Total_phenols' with median of that column.
initial_data$Total_phenols[is.na(initial_data$Total_phenols)]<- median(initial_data$Total_phenols,na.rm = TRUE)
initial_data$Ash[is.na(initial_data$Ash)]<- mean(initial_data$Ash,na.rm = TRUE)
initial_data$Alcalinity_of_ash[is.na(initial_data$Alcalinity_of_ash)]<- mean(initial_data$Alcalinity_of_ash,na.rm = TRUE)
initial_data$Magnesium[is.na(initial_data$Magnesium)]<- mean(initial_data$Magnesium,na.rm = TRUE)
initial_data$Flavanoids[is.na(initial_data$Flavanoids)]<- mean(initial_data$Flavanoids,na.rm = TRUE)
initial_data$Nonflavanoid_phenols[is.na(initial_data$Nonflavanoid_phenols)]<- mean(initial_data$Nonflavanoid_phenols,na.rm = TRUE)
initial_data$Proanthocyanins[is.na(initial_data$Proanthocyanins)]<- mean(initial_data$Proanthocyanins,na.rm = TRUE)
initial_data$Color_intensity[is.na(initial_data$Color_intensity)]<- mean(initial_data$Color_intensity,na.rm = TRUE)
initial_data$OD280_OD315_of_diluted_wines[is.na(initial_data$OD280_OD315_of_diluted_wines)]<- mean(initial_data$OD280_OD315_of_diluted_wines,na.rm = TRUE)
Confirming that the missing data have been treated with replace with mean and median. The miss leading data (9899999) will be treated with the outlier removal.
# Checking for missing data
initial_data_miss <- sum(is.na(initial_data))
cat (".", "\n")
## .
cat ("Missing data in this data set is : ", initial_data_miss)
## Missing data in this data set is : 0
outliner_data = par(mfrow = c(1,2))
for ( i in 1:14 )
{
boxplot(initial_data[[i]], col = "green")
mtext(names(initial_data)[i], cex = 0.8, side = 1, line = 2)
}
par(outliner_data)
The plots above displays the boxplot for the 14 variables. The following variables have outliers in them: 1. Ash 2. Alcalinity of ash 3. Magnesium 4. Total phenols 5. Color intensity 6. Hue 7. OD280/OD315 of diluted wines There is no outlier in the target variable, Alcohol.
data_outliers = c()
for ( i in 1:14 )
{
stats = boxplot.stats(initial_data[[i]])$stats
b_outlier_rows = which(initial_data[[i]] < stats[1])
t_outlier_rows = which(initial_data[[i]] > stats[5])
data_outliers = c(data_outliers , t_outlier_rows[ !t_outlier_rows %in% data_outliers ] )
data_outliers = c(data_outliers , b_outlier_rows[ !b_outlier_rows %in% data_outliers ] )
}
cat("The outlier observations are:", "\n")
## The outlier observations are:
data_outliers
## [1] 124 138 174 26 122 60 67 101 74 128 2 14 70 79 96 177 111 152 159
## [20] 160 172 116
Application of Cook’s distance to detect influential observations.
mod_cook = lm(Alcohol ~ ., data = initial_data)
sd_1 = cooks.distance(mod_cook)
plot(sd_1, pch = "*", cex = 2, main = "Influential Obs by Cooks distance")
abline(h = 4*mean(sd_1, na.rm = T), col = "red")
Based on the Cook’s distance to detect influential observations, the outliers would be removed including the 1 misleading value.
c_outliers = as.numeric(rownames(initial_data[sd_1 > 4 * mean(sd_1, na.rm=T), ]))
data_outliers = c(data_outliers , c_outliers[ !c_outliers %in% data_outliers ] )
# New without outliers now called data.
data = initial_data[-data_outliers, ]
Summary of statistics to show outliers have been removed from the data. The data is now ready for additional visualizations.
# Print summary of new data.
summary(data)
## Alcohol Malic_acid Ash Alcalinity_of_ash
## Min. :1.000 Min. :11.41 Min. :0.740 Min. :1.710
## 1st Qu.:1.000 1st Qu.:12.37 1st Qu.:1.607 1st Qu.:2.237
## Median :2.000 Median :13.06 Median :1.875 Median :2.360
## Mean :1.904 Mean :13.04 Mean :2.337 Mean :2.373
## 3rd Qu.:3.000 3rd Qu.:13.71 3rd Qu.:3.132 3rd Qu.:2.540
## Max. :3.000 Max. :14.83 Max. :5.190 Max. :2.920
## Magnesium Total_phenols Flavanoids Nonflavanoid_phenols
## Min. :12.00 Min. : 70.00 Min. :0.980 Min. :0.340
## 1st Qu.:17.48 1st Qu.: 88.00 1st Qu.:1.715 1st Qu.:1.215
## Median :19.50 Median : 98.00 Median :2.310 Median :2.100
## Mean :19.44 Mean : 98.55 Mean :2.284 Mean :2.024
## 3rd Qu.:21.12 3rd Qu.:106.00 3rd Qu.:2.800 3rd Qu.:2.885
## Max. :27.00 Max. :134.00 Max. :3.880 Max. :3.930
## Proanthocyanins Color_intensity Hue 12
## Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.5400
## 1st Qu.:0.2700 1st Qu.:1.235 1st Qu.: 3.250 1st Qu.:0.7975
## Median :0.3400 Median :1.535 Median : 4.750 Median :0.9600
## Mean :0.3591 Mean :1.539 Mean : 5.002 Mean :0.9577
## 3rd Qu.:0.4300 3rd Qu.:1.870 3rd Qu.: 6.200 3rd Qu.:1.1125
## Max. :0.6600 Max. :2.960 Max. :10.680 Max. :1.4500
## OD280_OD315_of_diluted_wines Proline
## Min. :1.270 Min. : 278.0
## 1st Qu.:2.007 1st Qu.: 507.5
## Median :2.780 Median : 675.0
## Mean :2.620 Mean : 757.6
## 3rd Qu.:3.170 3rd Qu.:1023.8
## Max. :4.000 Max. :1680.0
# Print dimension of data
dim(data)
## [1] 156 14
This section will deal with univariate, bivariate and multivariate analysis of the data set.
The diagrams below are initial histogram of the variables with the mean value of each variable.
dist_var = par(mfrow = c(1,2))
for ( i in 2:14 )
{
truehist(data[[i]], xlab = names(data)[i], col = 'lightgreen', main = paste("Average =", signif(mean(data[[i]]),3)))
}
Observations: 1. These variables were skewed to the right - Ash, Total Phenols, Proanthocyanins, Hue and Proline. 2. The rest of the data are skewed to the left per the display from the diagram.
A plot of the target (Alcohol) variable.
ggplot(initial_data, aes(x = Alcohol)) +
geom_histogram(bins = 10, position = 'identity', alpha = 0.4, fill = "blue") + labs(title = "Histogram of Alcohol variable") + geom_text(aes(label = scales::percent(..count../sum(..count..))), stat = 'count', vjust = -0.5)
The target variable has 3 classes and distributed by this percentages:
i. Class 1 - 33.1% ii. Class 2 - 39.9% iii. Class 3 - 27%
Display of pairplot of all variables.
pairs(data)
The pairplot above is not very visible for interpretation so will create
a ggplot for visibility and clarity.
Converting target variable Alcohol from integer to character data type to be able to plot.
data$Alcohol <- as.character(data$Alcohol)
ggpairs(data, columns = 2:5, aes(color = Alcohol, alpha = 0.5), upper = list(continuous = wrap("cor", size = 4)))
Observations: The data in the 4 variables are evenly distributed.
ggpairs(data, columns = 6:9, aes(color = Alcohol, alpha = 0.5), upper = list(continuous = wrap("cor", size = 4)))
Observations: The data in the 4 variables are all not evenly distributed.
ggpairs(data, columns = 10:14, aes(color = Alcohol, alpha = 0.5), upper = list(continuous = wrap("cor", size = 3)))
Displaying correlation between the variables in the data set.
data$Alcohol <- as.integer(data$Alcohol)
cor(data)
## Alcohol Malic_acid Ash
## Alcohol 1.00000000 -0.36970312 0.45461931
## Malic_acid -0.36970312 1.00000000 0.10649076
## Ash 0.45461931 0.10649076 1.00000000
## Alcalinity_of_ash -0.06352005 0.21113024 0.17134163
## Magnesium 0.56728900 -0.33315892 0.28469233
## Total_phenols -0.25621139 0.42787825 0.02075724
## Flavanoids -0.74897566 0.32909832 -0.35142113
## Nonflavanoid_phenols -0.87830665 0.30021648 -0.44515057
## Proanthocyanins 0.49596382 -0.19166875 0.29017899
## Color_intensity -0.59988675 0.19889666 -0.23022086
## Hue 0.19204566 0.56900577 0.31098169
## 12 -0.62645754 -0.00936294 -0.58806273
## OD280_OD315_of_diluted_wines -0.78847145 0.11346656 -0.38854082
## Proline -0.64491405 0.66063816 -0.18176581
## Alcalinity_of_ash Magnesium Total_phenols
## Alcohol -0.06352005 0.56728900 -0.2562113876
## Malic_acid 0.21113024 -0.33315892 0.4278782486
## Ash 0.17134163 0.28469233 0.0207572412
## Alcalinity_of_ash 1.00000000 0.30989743 0.4126787666
## Magnesium 0.30989743 1.00000000 -0.2034730000
## Total_phenols 0.41267877 -0.20347300 1.0000000000
## Flavanoids 0.12237815 -0.43213847 0.2544180011
## Nonflavanoid_phenols 0.06652652 -0.47657034 0.2119999204
## Proanthocyanins 0.07066865 0.32949206 -0.2566860344
## Color_intensity 0.04340127 -0.30857546 0.1100978370
## Hue 0.21647217 -0.04890394 0.3564850589
## 12 -0.01583350 -0.31496548 -0.0007513791
## OD280_OD315_of_diluted_wines -0.01990703 -0.35933498 0.0280598995
## Proline 0.25482074 -0.47659792 0.4423740852
## Flavanoids Nonflavanoid_phenols Proanthocyanins
## Alcohol -0.7489757 -0.87830665 0.49596382
## Malic_acid 0.3290983 0.30021648 -0.19166875
## Ash -0.3514211 -0.44515057 0.29017899
## Alcalinity_of_ash 0.1223781 0.06652652 0.07066865
## Magnesium -0.4321385 -0.47657034 0.32949206
## Total_phenols 0.2544180 0.21199992 -0.25668603
## Flavanoids 1.0000000 0.87292951 -0.49423764
## Nonflavanoid_phenols 0.8729295 1.00000000 -0.59595903
## Proanthocyanins -0.4942376 -0.59595903 1.00000000
## Color_intensity 0.6390263 0.72581281 -0.43779612
## Hue -0.0233295 -0.13209335 0.07465519
## 12 0.4573272 0.57833786 -0.25235424
## OD280_OD315_of_diluted_wines 0.6948129 0.77314613 -0.50695562
## Proline 0.5273710 0.54113981 -0.31295733
## Color_intensity Hue 12
## Alcohol -0.59988675 0.19204566 -0.6264575410
## Malic_acid 0.19889666 0.56900577 -0.0093629402
## Ash -0.23022086 0.31098169 -0.5880627320
## Alcalinity_of_ash 0.04340127 0.21647217 -0.0158335023
## Magnesium -0.30857546 -0.04890394 -0.3149654845
## Total_phenols 0.11009784 0.35648506 -0.0007513791
## Flavanoids 0.63902628 -0.02332950 0.4573272378
## Nonflavanoid_phenols 0.72581281 -0.13209335 0.5783378596
## Proanthocyanins -0.43779612 0.07465519 -0.2523542436
## Color_intensity 1.00000000 -0.00690653 0.3316192431
## Hue -0.00690653 1.00000000 -0.4503957523
## 12 0.33161924 -0.45039575 1.0000000000
## OD280_OD315_of_diluted_wines 0.59674385 -0.39936013 0.5604297231
## Proline 0.38752662 0.39641180 0.2423489538
## OD280_OD315_of_diluted_wines Proline
## Alcohol -0.78847145 -0.6449141
## Malic_acid 0.11346656 0.6606382
## Ash -0.38854082 -0.1817658
## Alcalinity_of_ash -0.01990703 0.2548207
## Magnesium -0.35933498 -0.4765979
## Total_phenols 0.02805990 0.4423741
## Flavanoids 0.69481294 0.5273710
## Nonflavanoid_phenols 0.77314613 0.5411398
## Proanthocyanins -0.50695562 -0.3129573
## Color_intensity 0.59674385 0.3875266
## Hue -0.39936013 0.3964118
## 12 0.56042972 0.2423490
## OD280_OD315_of_diluted_wines 1.00000000 0.3133982
## Proline 0.31339821 1.0000000
corrplot(cor(data))
From the diagram above, the following variables have good correlation:
1. Flavanoids 2. Nonflavanoid_phenols 3. OD280_OD315_of_diluted_wines 4.
Proline. These variables could be considered for future data processing
activities. Some of the variables also had negative correlation.
This project has been an extensive exercise of data analysis and visualization of the wine data provided. The following key observations was made during the data analysis: 1. Most of the time was spent on cleaning the data and visualizing the data. 2. The identified missing data were successfully treated. This was done by replacing them with the mean (numeric data type) or median (for integer data type) of the variable. 3. Identified outliers were successfully treated or removed. 3. In terms of correlation, 4 key variables were identified to be relevant to the data and could considered for processing in future. 4. Majority of the most of the data had their mean and median being closed to each other. After the extensive cleaning of these data and analysis, we can firm that this wine data is relevant and can be use for further research work and statistical or machine learning model building.